Data Lab 4 - Expanding Descriptive Comparisons: Medicaid and Health

Building on our previous analysis of the 2023 BRFSS data, we’re going to expand some of our descriptive comparisons between those with Medicaid coverage and the uninsured in this Data Lab. By incorporating additional demographic and health-related variables, we’ll explore systematic differences between these two groups, motivating the need to explore adjusted comparisons in future analyses.

Step 1: Importing and Subsetting the Data

Open the R Project file that you created for Data Lab 3. If you saved an R Script file after completing Data Lab 3, you can open that now. Otherwise, you can open a new R script file for this Data Lab.

Like we did last time, we’ll need to load the 2023 BRFSS survey data. If you have the code for this in your R Script file, that’s great. If not, you can grab the code from Data Lab 3. Regardless, we’ll need to make a slight modification to the code that will allow us to load a few additional variables.

library(haven)
library(dplyr)
brfss_data <- read_xpt("path_to_file/LLCP2023.XPT")
brfss_smaller <- select(brfss_data, PRIMINS1, GENHLTH, `_AGE80`, SEXVAR, INCOME3, EDUCA, SMOKE100, SMOKDAY2, ALCDAY4, DRNK3GE5)

Remember to change the “path_to_file” line above to the actual path to the saved “LLCP2023.XPT” file on your computer. If you receive an error message telling you that the file does not exist, first double check that you typed the correct path, second check to be sure that there’s not an extra space after after the .XPT file extension - something like “path_to_file/LLCP2023.XTP”. If you do have that extra space, you’ll either need to delete it or add a space after .XPT in the read_xpt command.

Notice that we’ve added two new variables when constructing the brfss_smaller data frame since the last Data Lab - SMOKDAY2 and DRNK3GE5. Take a look at the codebook to review the survey questions that correspond to these variables and the values the variables can take on.

Step 2: Cleaning the Data

Just like in Data Lab 3, we’ll need to do some cleaning of the data to keep only those who have Medicaid coverage or are uninsured and recode sex as female.

brfss_clean <- brfss_smaller %>%
  filter(PRIMINS1 %in% c(5, 88),
         `_AGE80` < 65) %>%
  mutate(FEMALE = ifelse(SEXVAR == 2, 1, 0), 
         MEDICAID = ifelse(PRIMINS1 == 5, 1, 0))

We’re using the filter command to restrict the sample to those with Medicaid coverage (PRIMINS1=5) or those with no insurance coverage (PRIMINS1=88) and those under age 65. The mutate command allows us to create two new variables: an indicator for female sex and an indicator for Medicaid coverage.

Next we need to define smoking status. This is tricky because of the way the BRFSS collects information on cigarette smoking. Take a look at the dataset to see what I mean by typing brfss_smaller into the Console command line. Notice that the “SMOKE100” variable is coded as either 1 or 2, while the “SMOKDAY2” variable contains a bunch of “NA” values. The reason for this is that the BRFSS first asks respondents whether they’ve smoked at least 100 cigarettes in their lifetime. If they respond “yes”, they then ask whether the respondent is currently an every day smoker, a some day smoker, or a nonsmoker. But if they responded “no” to the “SMOKE100” question (SMOKE100=2), then they’re not asked about their current smoking status. Hence the “NA” values for “SMOKDAY2” when “SMOKE100” is equal to 2.

For our purposes, we want to know whether someone currently smokes or not. We’ll count both everyday smokers and some day smokers as smokers and nonsmokers and never smokers as nonsmokers. Here’s how we can do that:

brfss_clean <- brfss_clean %>%
  mutate(SMOKER = case_when(
    SMOKDAY2 %in% c(1,2) ~ 1,
    SMOKE100 == 2 | SMOKDAY2 == 3 ~ 0,
    SMOKDAY2 %in% c(7,9) | SMOKE100 %in% c(7,9) ~ NA_real_
    )
  )

In this code, we’re creating a new variable called “SMOKER” that is equal to 1 if someone is a current everyday or some day smoker (SMOKDAY2=1 or 2) and is equal to 0 if someone used to smoke but doesn’t any longer or never smoked (SMOKE100=2 or SMOKDAY2=3). Note that the | character means “or” and the ~ character assigns the value to the new “SMOKER” variable. Also note that we’re taking any cases where either “SMOKE100” or “SMOKDAY2” is equal to 7 (“Don’t Know/Not Sure) or 9 (”Refused”) and setting those to “NA” (missing).

We can check to make sure that our SMOKER variable is coded correctly by typing table(brfss_clean$SMOKER) into the Console command line. You should see that 10,062 respondents are current smokers and 33,586 are nonsmokers. But what about those people we coded as “NA” (missing)? To see the number of respondents we coded as “NA” you can type table(brfss_clean$SMOKER, useNA = "ifany") into the Console command line. You should see that there are 2,991 respondents for whom we weren’t able to assign smoking status.

Now let’s create an indicator called “BINGE” that is equal to 1 if a respondent reports at least one incidence of binge drinking in past 30 days and is equal to 0 otherwise. The codebook tells us that binge drinking is defined as having 5 or more drinks for men or 4 or more drinks for women on a single occasion. Like the question on current smoking status, the question on binge drinking is only asked of respondents who reported drinking at least one alcoholic beverage during the past 30 days (ALCDAY4>=101 & ALCDAY4<=299). So to create an indicator for binge drinking, we’ll need to use the “ALCDAY4” variable along with the “DRNK3GE5” variable. See if you can figure out how to properly code “BINGE” (remember to account for those who haven’t had an alcoholic beverage in the past 30 days, ALCDAY4=888, those who answered “don’t know/not sure”, ALCDAY4=777 & DRNK3GE5=77, and those who refused to answer the question, ALCDAY4=999 & DRNK3GE5=99).

Once you’ve created the “BINGE” variable, use the table command to make sure it’s coded correctly. You should see that 7,473 respondents reported at least one instance of binge drinking over the past 30 days, 35,019 reported no instances of binge drinking over the past 30 days, and 4,147 respondents have missing values for “BINGE”.

Step 3: Revisiting the Relationship between Medicaid and Self-Reported Health

Before exploring other differences between Medicaid recipients and the uninsured, let’s revisit what we found in Data Lab 3. In Data Lab 3, we compared self-reported health between those with Medicaid coverage and the uninsured. We found that the uninsured appeared to report better health:

round(prop.table(table(brfss_clean$GENHLTH, brfss_clean$MEDICAID), margin = 2) * 100, 2)

11.26% of Medicaid recipients reported excellent health, compared to 15.89% of the uninsured and 9.79% of Medicaid recipients reported poor health, compared to only 4.70% of the uninsured.

brfss_clean %>%
  group_by(MEDICAID) %>%
  summarize(mean_health = mean(GENHLTH[!GENHLTH %in% c(7, 9)], na.rm = TRUE))

The mean self-reported health score (where higher scores indicate worse health) was 3.01 for Medicaid recipients and 2.76 for the uninsured. At first glance, these results might suggest that being uninsured is better for your health than having Medicaid! In fact, as we’ll see in class next week, people have used similar comparisons to make that exact claim!

But let’s take a closer look. One explanation for what we’re seeing is that Medicaid is worse for your health than being uninsured, but another explanation could be that Medicaid recipients and the uninsured differ in important ways that affect health. Let’s examine education, income, smoking, and binge drinking to determine whether these factors could confound the relationship between Medicaid and self-reported health.

Education Differences

Since higher education is linked to better health outcomes, let’s see if Medicaid recipients and the uninsured differ in education levels.

round(prop.table(table(brfss_clean$EDUCA, brfss_clean$MEDICAID), margin = 2) * 100, 2)

Notice that 76.42% of the uninsured are high school graduates compared to 86.54% of Medicaid enrollees. Similarly, Medicaid enrollees are more likely to have a college degree compared to the uninsured (18.99% vs. 17.93%). So there do seem to be differences in educational attainment between those with Medicaid and the uninsured, but the differences favor Medicaid enrollees and so probably don’t explain the results we’re seeing for self-reported health.

Income Differences

Like education, higher income is associated with better health. Let’s look to see whether income differs between Medicaid enrollees and the uninsured.

round(prop.table(table(brfss_clean$INCOME3, brfss_clean$MEDICAID), margin = 2) * 100, 2)

You’ll have to go back to the codebook to see how the BRFSS categorizes household income to interpret these results, but once you do, you’ll see something interesting. Those with Medicaid coverage tend to have lower incomes, on average, than the uninsured. For example, nearly 56% of Medicaid enrollees in the data have household incomes less than $35,000 compared to only 39% of those who are uninsured.

So maybe it’s not Medicaid coverage itself that is responsible for the lower self-rated health of Medicaid enrollees. Maybe it’s the fact that Medicaid enrollees tend to have lower household incomes than the uninsured. Let’s keep going.

Behavioral Differences

Next, let’s take a look at potential behavioral differences between Medicaid enrollees and the uninsured and see whether any of these differences might help explain the differences we’re seeing in self-reported health. This time, let’s include a visualization so it’s easy to see how Medicaid and hte uninsured differ in their smoking and drinking behaviors.

First, we’ll need to calculate the mean smoking and binge drinking rates for each group using the summarize function:

brfss_summary <- brfss_clean %>%
  mutate(MEDICAID = factor(MEDICAID, levels = c(0, 1), labels = c("Uninsured", "Medicaid"))) %>% 
  group_by(MEDICAID) %>%
  summarize(Smoking_Rate = mean(SMOKER, na.rm = TRUE) * 100,
            Binge_Drinking_Rate = mean(BINGE, na.rm = TRUE) * 100)

Here we’re creating new data frame called “brfss_summary” that contains the mean smoking and binge drinking rates for each group. We’re also explicitly telling R that “MEDICAID” is a factor variable (this will be important for our visualization below), and labeling the values of “MEDICAID” as either “Uninsured” (MEDICAID=0) or “Medicaid” (MEDICAID=1).

This is great, but the brfss_summary data frame that we just created is not in tidy format (type brfss_summary into the Console command line to see this). We’ll want to convert it to tidy format for creating our visualization. So let’s do this:

brfss_summary <- brfss_summary %>%
  pivot_longer(cols = c(Smoking_Rate, Binge_Drinking_Rate), 
               names_to = "Behavior", values_to = "Rate")

Now we have a data frame with three variables: MEDICAID (an indicator for Medicaid coverage), Behavior (either smoking or binge drinking), and Rate (the mean value for each group and behavior). We can create a chart that shows the differences in smoking and binge drinking rates by group using R’s ggplot command.

library(ggplot2)

ggplot(brfss_summary, aes(x = Behavior, y = Rate, fill = MEDICAID)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("red", "blue")) +
  labs(title = "Smoking and Binge Drinking Rates by Insurance Status",
       x = "Behavior",
       y = "Percentage of Respondents",
       fill = "Insurance Status") +
  theme_minimal()

ggplot is one of the most common ways to create visualizations in R. Let’s go through this code line by line. The first line:

ggplot(brfss_summary, aes(x = Behavior, y = Rate, fill = Medicaid)) +

says that we’re going to create a plot using the brfss_summary data frame. The aes option allows us to tell R what data elements we want on the x-axis, the y-axis, and that we want to color the bars based on Medicaid status. The term aes stands for “aesthetic”.

The next line:

geom_col(position = "dodge") +

tells R to create a bar chart (geom_col) and to put the bars next to each other instead of stacking them (position = "dodge"). To see why adding the position = "dodge" option is important, you can re-run the code above removing position = "dodge" from the parentheses.

Next we have:

scale_fill_manual(values = c("red", "blue")) +

tells R that we want to manually choose which colors to use to fill the bars in the plot and that we want those colors to be red and blue (you can experiment with other colors if you’d like).

The next few lines:

labs(title = "Smoking and Binge Drinking Rates by Insurance Status",
    x = "Behavior",
    y = "Percentage of Respondents",
    fill = "Insurance Status") +

are just adding labels to the visualization so that it’s easier to understand what the graph is showing.

Finally, the theme(minimal) is just a stylistic touch to make things look a little nicer. Try running the code without that and you’ll see the difference.

So let’s take a second to evaluate the data in the visualization. It looks like Medicaid enrollees smoke more than those who are uninsured (24.7% vs. 21.1%), but are far less likely to engage in instances of binge drinking (14.4% vs. 21.4%). Again, it’s not entirely clear how differences in these behaviors might relate to differences in self-reported health. But these results, combined with the differences in education and income that we noted earlier, suggest that there could be factors other than insurance coverage that could be confounding simple comparisons of Medicaid coverage and self-reported health.

Summary and Key Takeaways

In Data Lab 3, we observed that the uninsured reported better self-reported health than those with Medicaid coverage. In this lab, we’ve taken a closer look at other key differences between these groups that might explain this pattern. We’ll continue to explore this dynamic in future Data Labs and problem sets to try and get a better sense of the relationship between Medicaid coverage and health.